feat(agent): persistent caching for agent#59
Conversation
Greptile OverviewGreptile SummaryThis PR adds persistent caching for agent dynamic secret leases using Badger DB with AES-GCM encryption, enabling the agent to recover leases across Kubernetes pod restarts. Minor improvements include Key Changes:
Critical Security & Logic Issues Found:
Recommendations:
Confidence Score: 2/5
Important Files ChangedFile Analysis
|
| // lease not found in API, delete it from cache and return nil | ||
| if errors.Is(err, api.ErrNotFound) { | ||
| log.Warn().Msgf("dynamic secret lease does not exist, deleting from cache: [lease-id=%s]", leaseFromCache.LeaseID) | ||
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | ||
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| // lease is found in cache but not in the the API, and the API returned a non 404-error. We should attempt to revoke it | ||
| // at this point we know that we should be able to reach the API because we've done authentication successfully | ||
| log.Warn().Msgf("unable to get dynamic secret lease from API. Revoking lease from cache: [lease-id=%s]", leaseFromCache.LeaseID) | ||
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | ||
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | ||
| } | ||
|
|
||
| if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil { | ||
| log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
We are deleting it anyway, so the if should be to revoke or not:
| // lease not found in API, delete it from cache and return nil | |
| if errors.Is(err, api.ErrNotFound) { | |
| log.Warn().Msgf("dynamic secret lease does not exist, deleting from cache: [lease-id=%s]", leaseFromCache.LeaseID) | |
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | |
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | |
| } | |
| return nil | |
| } | |
| // lease is found in cache but not in the the API, and the API returned a non 404-error. We should attempt to revoke it | |
| // at this point we know that we should be able to reach the API because we've done authentication successfully | |
| log.Warn().Msgf("unable to get dynamic secret lease from API. Revoking lease from cache: [lease-id=%s]", leaseFromCache.LeaseID) | |
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | |
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | |
| } | |
| if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil { | |
| log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err) | |
| return nil | |
| } | |
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | |
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | |
| } | |
| // only attempt to revoke if the lease exists (not a 404 error) | |
| // if it's 404, the lease doesn't exist so there's nothing to revoke | |
| if !errors.Is(err, api.ErrNotFound) { | |
| log.Warn().Msgf("unable to get dynamic secret lease from API. Attempting to revoke lease: [lease-id=%s]", leaseFromCache.LeaseID) | |
| if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil { | |
| log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err) | |
| } | |
| } else { | |
| log.Warn().Msgf("dynamic secret lease does not exist in API: [lease-id=%s]", leaseFromCache.LeaseID) | |
| } |
There was a problem hiding this comment.
Not sure I'm following, what we want is to is:
If 404:
Delete from cache, don't even try to revoke. Just return right away
If not 404:
Delete from cache, and try to revoke
I feel like both are trying to achieve the same, but I feel like what we have currently reads a bit easier?
| if errors.Is(err, api.ErrNotFound) { | ||
| log.Warn().Msgf("dynamic secret lease does not exist, deleting from cache: [lease-id=%s]", leaseFromCache.LeaseID) | ||
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | ||
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
|
|
||
| // lease is found in cache but not in the the API, and the API returned a non 404-error. We should attempt to revoke it | ||
| // at this point we know that we should be able to reach the API because we've done authentication successfully | ||
| log.Warn().Msgf("unable to get dynamic secret lease from API. Revoking lease from cache: [lease-id=%s]", leaseFromCache.LeaseID) | ||
| if err := d.DeleteLeaseFromCache(leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.Slug, leaseFromCache.TemplateID); err != nil { | ||
| log.Warn().Msgf("[cache]: unable to delete lease from cache: %v", err) | ||
| } | ||
|
|
||
| if err := revokeDynamicSecretLease(accessToken, leaseFromCache.ProjectSlug, leaseFromCache.Environment, leaseFromCache.SecretPath, leaseFromCache.LeaseID); err != nil { | ||
| log.Warn().Msgf("unable to revoke dynamic secret lease %s: %v", leaseFromCache.LeaseID, err) | ||
| return nil |
There was a problem hiding this comment.
Suggestion: We could do all this release and revoke process in a go routine to unblock the function from async operations.
It doesn't help anything in the rate limiting situation, but in all other cases we could take advantage of it.
There was a problem hiding this comment.
Hmm how is that? When we return nil from GetLease, it is handled as the lease doesn't exist and then a new lease will be provisioned. If we moved it to a goroutine we may face race conditions and cases where GetLease gets a lease even though it doesn't exist? It could be counteracted with mutation locks and waits but it wouldn't speed up the execution and would make things more complicated to follow
victorvhs017
left a comment
There was a problem hiding this comment.
Approved because this is time sensitive, if we have time to do the suggestions nice, if not, we can revisit later!
| } | ||
|
|
||
| // we call appendUnsafe because we already hold the lock, and if we call Append directly we'll get a deadlock | ||
| d.appendUnsafe(*leaseFromCache) |
There was a problem hiding this comment.
Suggestion: appendUnsafe always saves the cache to the file:
But in this case, we just got the cache from the file:
And we don't update this value, which means that we are just saving it again. Saving into cache is IO operation, we could avoid this extra IO with a boolean flag in the appendUnsafe function to save in the persistent cache or not.




Description 📣
This PR adds persistent caching to the agent (specifically for kubernetes). The agent will now be able to persist dynamic secret leases in between restarts from Kubernetes. Access tokens are not stored in the cache, instead they continue to exist in sinks like they always have. Sinks are files.
This is a prerequisite for adding hand-over support to the Agent Injector, so the agent injector will keep managing the same dynamic secret leases when the user has configured the agent injector to run both init and sidecar mode at the same time.
Minor:
LOG_LEVELenvironment variable instead of having to pass--log-level.Type ✨
Tests 🛠️
# Here's some code block to paste some code snippets